An Approach to Computing Predicted Purchase - Part 3/3
Background:
If you work with transactional data, you probably have faced the challenge of computing meaningful KPIs that allow you to somehow quantify how promising a given customer is / will be (based on past data of course). One computation that comes in handy when evaluating customers is the predicted purchase probability, i.e., a value that allows you to assess whether it might be worth investing in this customer, from a marketing perspective.
Given the data, the predicted purchase is a probabilistic quantification of future purchase amount, for a given customer
from IPython.display import Image
PATH = "/home/julien/data_blog/post_material/predictive_purchase/"
Image(filename = PATH + "purchase_history.png", width=500, height=500)
Now there are many ways one could proceed to compute probability of purchase. In this post, I will use one method that I find pretty elegant, discussed in a paper publised by Fader and colleagues in 2005.
Important note
While I used this model in a business project, and it produced useful results that went into production, I am not an expert on financial computations, so take this as it is: an honest attempt to compute a useful metric!
This post is structured in two separate sections:
1) generating transactional data randomly, and making a few observations about what influences the fit of your model
2) using real data from transactions of CDs (feeling some nostalgia there)
Import libraries¶
We first start by importing required libraries, and setting a few parameters for the notebook that I like to use. If you do not have these libraries installed, simply install them in your environment using pip install
import os
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import plotnine as p9
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
display_settings = {
'max_columns': 10,
'expand_frame_repr': True, # Wrap to multiple pages
'max_rows': 50,
'precision': 2,
'show_dimensions': True
}
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
from IPython.display import clear_output
import timeit
# Set figure size for easier reading
plt.rcParams['figure.figsize'] = (12,6)
Import a typical transactional dataset
from lifetimes.datasets import load_dataset
df = load_dataset(
'CDNOW_sample.txt',
header=None,
names=['customer_id', 'customer_index','transaction_date', 'quantity', 'amount'],
delim_whitespace = True,
#converters=['date': lambda d: pd.to_datetime(d, format='%Y%m%d')]
).drop(columns={'customer_index'})
df.head(2)
This is a transactional dataset from a business allowing continous purchase of items (CDs). We have access to three data features: (i) date of transaction, (ii) quantity purchased and (iii) customer ID. That is all you need to get started.
In order to train the model, you need to boil down a dataframe at the customer level with three parameters:
1) frequency: number of purchase for that customer.
2) recency: time between first and last purchas, in days, weeks or any meaningful scale for your business.
3) T: age of the customer (birth = first purchase)
This would be a so-called RFM table (Recency-Frequency-Monetary; note that I will be discussing the monetary component in a separate post).
from lifetimes.datasets import load_transaction_data
from lifetimes.utils import summary_data_from_transaction_data
rfm = summary_data_from_transaction_data(transaction_data, customer_id_col = 'id', datetime_col='date')
rfm.sample(3)
We can see that customer # 2446 performed two purchases, that his first purchase happened 332 ago, and that there was a delay of 177 between his first and last purchase.
The drawing below depicts the meaning of each parameter in the RFM table.
PATH = "/home/julien/data_blog/post_material/predictive_purchase/"
Image(filename = PATH + "frequency_recency.png", width=500, height=500)
We can have a quick look at these features, see whether the distributions make sense
pd.plotting.scatter_matrix(rfm, hist_kwds={'bins':50});
A few observations:
(i) As expected from the data features, we can see that the majority of customers purchased very few times (frequency).
(ii) As a results of course, the recency feature has a major pic at 0, which contains all customers that purchased only one time from that business.
(iii) The business had some stability in acquiring new customers, as suggested by the uniformity of the age (T) distribution
This was an example dataset of how transactional data might look like.
Now a very simple and easy to implement approach would be to calculate a so-called RFM score at the customer level. Data points from each data feature are categorized according to the distribution's quartile of that feature, which provides a score for each customer (in our case only RF score, since we do not have access to monetary value data).
quantiles = rfm.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
segmented_rfm = rfm.copy()
def RScore(x,p,d):
if x <= d[p][0.25]:
return 1
elif x <= d[p][0.50]:
return 2
elif x <= d[p][0.75]:
return 3
else:
return 4
def FMScore(x,p,d):
if x <= d[p][0.25]:
return 4
elif x <= d[p][0.50]:
return 3
elif x <= d[p][0.75]:
return 2
else:
return 1
# Categorize data points according to quartile
segmented_rfm = segmented_rfm.assign(r_quartile = segmented_rfm['recency']
.apply(RScore, args=('recency',quantiles)))
segmented_rfm = segmented_rfm.assign(f_quartile = segmented_rfm['frequency']
.apply(FMScore, args=('frequency',quantiles)))
# Create RS score per customer
segmented_rfm['RFMScore'] = (segmented_rfm.r_quartile.map(str)
+ segmented_rfm.f_quartile.map(str)
)
segmented_rfm.head()
While there is some virtue to this method (for instance interpretability, simplicity and speed of implementation), one might wonder whether it might be wise to assume that past behavior maps onto future behavior that straighforwardly.
Instead, we could use machine learning to predict purchase base on past purchase pattern for each customer.
Pareto/NBD vs BetaGeo/NBD
From an historical perspective, the Pareto/NBD model has been an attractive model to compute predicted purchase in different business models. However, the complexity of the Pareto/NBD model made it quite complex for businesses to implement it for regular update of predictive purchase. As an alternative, Fader & colleagues proposed the BetaGeo/NBD model, which produces arguably comparable fit to the Pareto, but is simpler in implementation.
We will first focus on the BG/NBD model and test its implementation on a public dataset. In the follow-up post, I will use the Pareto/NBD model on the same dataset, and explore it from another perspective using hierarchical Bayesian modeling to predict purchases.
from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter(penalizer_coef=0.0)
Assumptions of the BG/NBD model (Fader et al., 2005)
(i) While active, the number of transactions made by a customer
follows a Poisson process with transaction rate lambda.
This is equivalent to assuming that the time between transactions
is distributed exponential with transaction rate $\lambda$
(ii) Heterogeneity in lambda follows a gamma distribution
\begin{equation*} \left( \ f(\lambda | r, \alpha \right) = \left( \ \frac{\alpha^r \lambda^{r-1} e^{-\lambda \alpha}}{\Gamma r} \right) \left(\lambda > 0\right) \end{equation*}(iii) After any transaction, a customer becomesinactive with probabilityp. Therefore the point at which the customer “drops out” is distributed across transactions according to a (shifted) geometric distribution
\begin{equation*} \left( \ p(1-p)^{j-1}\right) \left(\ j = 1, 2, 3 ...\right) \end{equation*}(iv) Heterogeneity inpfollows a beta distribution
\begin{equation*} \left(\ f(p | a,b \right) = \left( \ \frac{p^{a-1} (1-p)^{b-1}}{B(a,b)} \right) \left( 0 \leq p \leq 1\right) \end{equation*}(v) The transaction rate lambda and the dropout probability p vary independently across customers
Let's check whether the model has an acceptable performance to predict data. To do so, we will train the model on a training set and ask the model to predict data from a test set (AKA validation / holdout period).
You could do it yourself using SciKit Learn or either manual function, but Lifetimes proposed a utility function to split original transactional data (which is often the format available at different companies).
from lifetimes.utils import calibration_and_holdout_data
transaction_data = load_transaction_data()
calibration_period_end='2014-09-01'
observation_period_end='2014-12-31'
summary_cal_holdout = calibration_and_holdout_data(transaction_data, 'id', 'date',
calibration_period_end=calibration_period_end,
observation_period_end=observation_period_end)
summary_cal_holdout.head(2)
# Fit model to the data
bgf.fit(summary_cal_holdout['frequency_cal'],
summary_cal_holdout['recency_cal'],
summary_cal_holdout['T_cal']
)
From the output above, we see that the function 1) splits the data into two sets and 2) produces a so-called RF table.
We run some standard diagnostic, where we plot the average number of transactions in the test set vs training set, for observed ("frequency_holdout") and predicted data ("model predictions")
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
# Age in training set
age_train = np.mean(summary_cal_holdout.T_cal)
# Age in testing set is the difference between the two time points taken to define test set
age_test = time_window = (pd.to_datetime(calibration_period_end).week
- pd.to_datetime(observation_period_end).week)
# X axis limit from observed output
xlim=[0,7]
x_axis_no_model = np.linspace(xlim[0],xlim[1], 100)
no_model_line = (age_test/age_train) * x_axis_no_model
plt.plot(x_axis_no_model, no_model_line, linestyle='--', color='black', label='no model')
We can see that the model is quite good in picking up the purchase dynamics.
Most customers make low number of purchases, which is probably why we get a better fit at lower bins, compared to higher bins.
Importantly, if we plot what an estimation of future purchase might be without the use of a model (or rather, using simple heuristics to estimate that value), we would be overestimating purchases, which would impact our business in the future.
Now that we have validated the model using a validation set, let's train the model on the whole dataset.
from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(data['frequency'], data['recency'], data['T'])
print(bgf)
# Model summary
bgf.summary
Lifetime provides some nice matrix plots, that allow you to visualize the probabilitz of customers being alive given the parameters, and some other features of your dataset. I won't use them here but I recommend you explore these plots, they might have an added value when explaining the process to stakeholders.
Now that our model has been validated and fited on the whole dataset, we can go on and predict the predictive purchase value of the customers in the dataset, for a given time window t:
# Predicted purchases are calculated for a given time window t
t = 12
rfm['pred_purch_coming_week=' + str(t)] = (bgf.conditional_expected_number_of_purchases_up_to_time
(
t,
rfm['frequency'],
rfm['recency'],
rfm['T']
)
)
# Plot the predictive purchase histogram
rfm['pred_purch_coming_week=' + str(t)].plot(kind='hist',bins=50)
plt.xlabel('proba of number of purchases in coming weeks=' +str(t))
plt.title('histo, customer count per proba of purchase');
from pymc3.math import exp, log
This is another approach proposed during the PyData Seattle 2017 by Van Dyke & Gauthier, who uses the Bayesian approach "to approximate the full posterior of the model parameters, and gain a nice understanding of the uncertainty around our estimates, not just the average case"
# import theano
from pymc3.math import exp, log
class ParetoNBD(pm.Continuous):
"""
Custom distribution class for Pareto/NBD likelihood.
"""
def __init__(self, lambda_, mu, *args, **kwargs):
super(ParetoNBD, self).__init__(*args, **kwargs)
self.lambda_ = lambda_
self.mu = mu
def logp(self, x, t_x, T):
"""
Loglikelihood function for and indvidual customer's purchasing rate \lambda
and lifetime \mu given their frequency, recency and time since first purchase.
"""
log_lambda = log(self.lambda_)
log_mu = log(self.mu)
mu_plus_lambda = self.lambda_ + self.mu
log_mu_plus_lambda = log(mu_plus_lambda)
p_1 = x * log_lambda + log_mu - log_mu_plus_lambda - t_x * mu_plus_lambda
p_2 = (x + 1) * log_lambda - log_mu_plus_lambda - T * mu_plus_lambda
return log(exp(p_1) + exp(p_2))
# Extract data for model following notation from Fader/Hardie
N = rfm.shape[0] # number of customers
x = rfm['frequency'].values
t_x = rfm['recency'].values
T = rfm['T'].values # length of training period
n_draws = 2000
pnbd_model = pm.Model()
with pnbd_model:
# Uninformative priors on model hyperparameters see Polson and Scott
# https://projecteuclid.org/download/pdfview_1/euclid.ba/1354024466
r = pm.HalfCauchy('r', beta=2)
alpha = pm.HalfCauchy('alpha', beta=2)
s = pm.HalfCauchy('s', beta=2)
beta = pm.HalfCauchy('beta', beta=2)
# Gamma prior on purchasing rate parameter lambda
lambda_ = pm.Gamma('lambda', alpha=r, beta=alpha, shape=N, testval=np.random.rand(N))
# Gamma prior on lifetime parameter mu
mu = pm.Gamma('mu', alpha=s, beta=beta, shape=N, testval=np.random.rand(N))
# Custom distribution for Pareto-NBD likelihood function
loglikelihood = ParetoNBD("loglikelihood", mu=mu, lambda_=lambda_, observed={'x': x, 't_x': t_x, 'T': T})
# Sample the model
trace = pm.sample(n_draws, init=None)
# Traceplots to check for convergence
_ = pm.traceplot(trace, varnames=['r','alpha','s','beta'])
def prob_alive_at_T(lambda_, mu, t_x, T):
"""
Probability a customer's time of death \tau occurs after T.
Pr(\tau > T \> | \> \lambda, \mu, x, t_x,T) = \dfrac{1}{1+\dfrac{\mu}{\mu+\lambda}
\big\{e^{(\lambda+\mu)(T-t_x)}-1\big\}}
See expression (7) in technical appendix of Abe.
:param lambda_: lambda parameter at the customer level :
:type lambda_: scalar
:param mu: mu parameter at the customer level
:type mu_: scalar
:param t_x: recency of transactions
:type t_x: float
:param T: duration of the calibration/training period
:type T: float
:return: probability of being alive at time T.
"""
den = 1 + (mu / (lambda_ + mu)) * (np.exp((lambda_ + mu) * (T - t_x)) - 1)
return 1 / den
Cumulative expected purchases for customer in period (T, T+t)¶
def likelihood(lambda_, mu, x, t, T):
"""Pareto/NBD likelihood function.
:param lambda_: lambda parameter at the customer-level.
:param mu: mu parameter at the customer level
:param x: number of repeat transactions
:param t: recency
:param T: length of the calibration/training period.
:return: likelihood value.
"""
p1 = x * np.log(lambda_) + np.log(mu) - np.log(mu + lambda_) - t * (mu + lambda_)
p2 = (x + 1) * np.log(lambda_) - np.log(mu + lambda_) - T * (mu + lambda_)
return np.exp(p1) + np.exp(p2)
def predict(t, lambda_, mu, x, tx, T):
"""Conditional expected purchases at end of time (T, T+t).
Used to assess holdout period performance and to make predictions
for future time period t. Conditional on purchase history (x, t_x, T).
E[X(T,T+t) \> | \> \lambda, \mu, x, t_x, T] = \dfrac{1}{L(\lambda, \mu)|(x, t_x, T)}
\times \dfrac{\lambda^{x+1}}{\mu}e^{-(\lambda+\mu)T}(1-e^{-\mu t})
http://brucehardie.com/notes/034/corr_Pareto-NBD.pdf
:param t: time period
:type t: scalar
:param lambda_: lambda parameter at the customer level :
:type lambda_: scalar
:param mu: mu parameter at the customer level
:type mu_: scalar
:param x: number of repeat transactions
:type x: integer
:param tx: recency of transactions
:type tx: float
:param T: duration of the calibration/training period
:type T: float
:return expected number of purchases (scalar)
"""
like = likelihood(lambda_, mu, x, tx, T)
p2 = lambda_ ** (x + 1) / mu * np.exp(-(lambda_ + mu) * T) * (1 - np.exp(-mu * t))
return 1 / like * p2
_ = pm.plot_posterior(trace, varnames=['r','alpha','s', 'beta'])
customer_index = 150
# show purchasing behavior
rfm.iloc[customer_index]
lambda_post = trace['lambda']
mu_post = trace['mu']
We underforecast the number of purchases for this customer in the holdout period, but we correctly identified that the customer was still alive.
Holdout Period Predictions for Entire Customer Cohort¶
To get a picture of model performance for the entire
# holdout period is 39 weeks
t = 39
# predictions are size of customer base x number of draws
holdout_predictions = np.empty([N, n_draws])
for i in np.arange(N):
holdout_predictions[i] = predict(
t,
lambda_post[:,i],
mu_post[:,i],
x[i],
t_x[i],
T[i]
)
# predictions are posterior mean
rfm['frequency_predicted'] = holdout_predictions.mean(axis=1)
In the holdout period we predicted 1,724 purchases and observed 1,788.
rfm[['frequency_holdout','frequency_predicted']].sum()
The plot below shows the average purchases in the holdout period grouped by the number of repeat purchases made in the calibration period.
xlim=(1, 20)
mean_frequencies = rfm.groupby('frequency_cal')[['frequency_holdout',
'frequency_predicted']].mean()
mean_frequencies.rename(columns={'frequency_holdout': 'Observed',
'frequency_predicted': 'Model'},
inplace=True)
mean_frequencies.plot(kind='line',
title='Conditional Expected Purchases in Holdout Period', figsize=(16, 8))
# Generate a dummy model with holdout freq = t_holdout/t_calib
t_calib = np.mean(rfm['T_cal'])
t_holdout = t
x_heuristics = np.linspace(xlim[0],xlim[1],100)
heuristics = t_holdout/t_calib * x_heuristics
plt.plot(x_heuristics, heuristics, linestyle='--', color='black', label='Simple Heuristics')
plt.legend(loc=0)
plt.xlim(xlim)
plt.ylabel('Mean Number of Transactions in Holdout Period')
plt.xlabel('Number of Transactions in Calibration Period')
plt.grid(False)
plt.show()
from sklearn.metrics import mean_squared_error
print("RMSE: %s" % np.sqrt(mean_squared_error(rfm['frequency_holdout'], rfm['frequency_predicted'])))
# Select distributions of lambda and mu for a customer
lambda_individual = lambda_post[:,customer_index]
mu_individual = mu_post[:,customer_index]
# predict purchases for the user at t = 39
t = 39
predict(t, lambda_individual, mu_individual, x[customer_index], t_x[customer_index], T[customer_index]).mean()